Section 1. Problem Domain Description

COVID-19, which is short for coronavirus disease of 2019, is the illness caused by the SARS-CoV-2 virus first identified in Wuhan, China in December of 2019. Since then, the virus has rapidly spread across the world, leading the World Health Organization to declare a global pandemic. Millions of Americans have been infected by the virus, and hundreds of thousands have died due to the disease with those numbers only continuing to grow each day. A global race to develop a vaccine in record-breaking time ensued, with over 100 different candidates being tested across the globe. Despite multiple vaccines receiving emergency authorizations from multiple different nations, the situation is worsening daily as new mutant strains are being identified such as those identified in the United Kingdom. In the United States, public health officials are struggling to convince the populous that the vaccines are safe and effective, leading to widespread anti-vaccine protests seeking to slow the vaccination efforts, which only lends itself to give the virus more time to develop a mutation to defeat the current vaccine formulations.

Thus, analyzing data related to COVID-19 is worthwhile since it will help people understand the overall situation and severity of the pandemic and arouse their interest in adopting protective measures like mask-wearing, social-distancing, and vaccination. In addition, analyzing this data may expose differences in the ability of different regulations between states to contain the virus, which may prove beneficial in helping state governments are only utilizing restrictions that truly work to contain this pathogen.

Section 2. Data Description

JHU CSSE COVID-19 Data

The COVID-19 Data Repository by the Center for System Science and Engineering (CSSE) at Johns Hopkins University is compiled from sources such as, but not limited to, the World Health Organization and the United States Centers for Disease Control and Prevention (a list of all data sources is provided in the README.md file of the repository) provides case and deaths counts for each state/U.S. territory for each day since the SARS-CoV-2 virus was first detected in Washington state in January of 2020. This data set has been known to provide some of the most up-to-date information possible, which has resulted in many different organizations citing this data as trustworthy and reliable.

UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ Combined_Key X1.22.20
84001001 US USA 840 1001 Autauga Alabama US 32.53953 -86.64408 Autauga, Alabama, US 0
84001003 US USA 840 1003 Baldwin Alabama US 30.72775 -87.72207 Baldwin, Alabama, US 0
84001005 US USA 840 1005 Barbour Alabama US 31.86826 -85.38713 Barbour, Alabama, US 0
84001007 US USA 840 1007 Bibb Alabama US 32.99642 -87.12511 Bibb, Alabama, US 0
84001009 US USA 840 1009 Blount Alabama US 33.98211 -86.56791 Blount, Alabama, US 0
84001011 US USA 840 1011 Bullock Alabama US 32.10031 -85.71266 Bullock, Alabama, US 0
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ Combined_Key Population
84001001 US USA 840 1001 Autauga Alabama US 32.53953 -86.64408 Autauga, Alabama, US 55869
84001003 US USA 840 1003 Baldwin Alabama US 30.72775 -87.72207 Baldwin, Alabama, US 223234
84001005 US USA 840 1005 Barbour Alabama US 31.86826 -85.38713 Barbour, Alabama, US 24686
84001007 US USA 840 1007 Bibb Alabama US 32.99642 -87.12511 Bibb, Alabama, US 22394
84001009 US USA 840 1009 Blount Alabama US 33.98211 -86.56791 Blount, Alabama, US 57826
84001011 US USA 840 1011 Bullock Alabama US 32.10031 -85.71266 Bullock, Alabama, US 10101
Data Set Features of Note
  • Admin2: name of county/political subdivision of U.S. state/territory
  • Province_State: name of U.S. state/territory
  • Xmm.dd.yy: one feature per day since the SARS_CoV_2 virus was first detected in the United States representing the case/death count of the county/political subdivision definied by the Admin2 feature; takes the format of Xmm.dd.yy where mm is the one- or two-digit month as a decimal, dd is the one- or two-digit day of the month as a decimal, and yy is the two-digit year without century as a decimal

HIFLD Hospitals

The Homeland Infrastructure Foundation-Level Data Hospitals (HIFLD Hospitals) data set published by the United States Department of Homeland Security and compiled from sources from the United States Department of Health & Human Services and Centers for Disease Control and Prevention provides a list of all hospitals in the United States and their associated trauma level. It identifies how many hospitals and of what type exist in each state.

NAME STATE TYPE BEDS TRAUMA
CENTRAL VALLEY GENERAL HOSPITAL CA GENERAL ACUTE CARE 49 NA
LOS ROBLES HOSPITAL & MEDICAL CENTER - EAST CAMPUS CA GENERAL ACUTE CARE 62 NA
EAST LOS ANGELES DOCTORS HOSPITAL CA GENERAL ACUTE CARE 127 NA
SOUTHERN CALIFORNIA HOSPITAL AT HOLLYWOOD CA GENERAL ACUTE CARE 100 NA
KINDRED HOSPITAL BALDWIN PARK CA GENERAL ACUTE CARE 95 NA
LAKEWOOD REGIONAL MEDICAL CENTER CA GENERAL ACUTE CARE 172 NA
Data Set Features of Note
  • STATE: two-letter U.S.P.S. abbreviation of state name
  • TYPE: type of hospital; value can be "GENERAL ACUTE CARE", "CRITICAL ACCESS", "PSYCHIATRIC", "LONG TERM CARE", "REHABILITATION", "MILITARY", "SPECIAL", "CHILDREN", "WOMEN", or "CHRONIC DISEASE"
  • STATUS: current status of hospital; value either "OPEN" or "CLOSED"
  • LATITUDE: latitude of hospital
  • LONGITUDE: longitude of hospital
  • BEDS: number of beds available at hospital; value of -999 represents an unknown count of beds
  • TRAUMA: non-standard trauma center level identifier (definitions can be found in the HIFLD Trauma Levels Data Set); value of "NOT AVAILABLE" indicates the hospital is not classified as a trauma center

NYT Mask-Wearing Survey

The NYT Mask-Wearing Survey data set contains estimates of mask-usage from 250,000 survey responses for each county in the US. Each participant was asked “How often do you wear a mask in public when you expect to be within six feet of another person?” and given the choices of never, rarely, sometimes, frequently, or always. The survey was done in 2020 from July 2 to July 14, and was assembled by The New York Times and Dynata.

COUNTYFP NEVER RARELY SOMETIMES FREQUENTLY ALWAYS
1001 0.053 0.074 0.134 0.295 0.444
1003 0.083 0.059 0.098 0.323 0.436
1005 0.067 0.121 0.120 0.201 0.491
1007 0.020 0.034 0.096 0.278 0.572
1009 0.053 0.114 0.180 0.194 0.459
1011 0.031 0.040 0.144 0.286 0.500

The COUNTYFP column is the FIPS code for the county, and the rest of the columns are estimates for the percent of people in that county who responded with that option. Those values always add up to about one.


CDC COVID-19 Vaccinations in the United States

The COVID-19 Vaccinations in the United States data set contains number of vaccine doses administered by state. Data on COVID-19 vaccine doses administered in the United States are collected by vaccination providers and reported to CDC through multiple sources, including jurisdictions, pharmacies, and federal entities, which use various reporting methods, including Immunization Information Systems, Vaccine Administration Management System, and direct data submission.

State Total_Doses_Administered Doses_Administered_per_100k X18._Doses_Administered X18._Doses_Administered_per_100K Ratio_Doses_Administered
Alaska 239927 32797 238872 43308 0.32797
Alabama 815108 16624 814893 21361 0.16624
Arkansas 540192 17900 540003 23300 0.17900
American Samoa 18816 33788 18600 42821 0.33788
Arizona 1525794 20962 1524293 27034 0.20962
Bureau of Prisons 52743 NA 52740 NA NA
  • Total doses administered column is the total number of vaccine doses that have been given to people.

  • Doses administered per 100k column is the total number of vaccine doses given for every 100,000 people.

  • 18+ Doses Administered column is the total number of vaccine doses that have been given to people for the overall population

  • 18+ Doses administered per 100k column is the total number of vaccine doses given for every 100,000 people aged 18 years and older.


Stay At Home Order Effectivenes For Each State

The Infection rates before and after stay at home orders went into effect set contains a list of each state and the date on which the first stay at home order was put into effect. It also has infection rates for days before and after the enstatement of these orders. Infection rates were calculated using daily COVID-19 daily cases collected by Johns Hopkins Center for Health Security.

State Order.date Infection.rate.and.confidence.interval..before.order. Infection.rate.and.confidence.interval..after.order.
Alabama 4/4/20 0.099 (0.088, 0.109) 0.042 (0.039, 0.045)
Alaska 3/28/20 0.11 (0.095, 0.126) 0.03 (0.027, 0.032)
Arizona 3/31/20 0.134 (0.124, 0.143) 0.03 (0.025, 0.036)
California 3/19/20 0.084 (0.077, 0.091) 0.055 (0.05, 0.06)
Colorado 3/26/20 0.11 (0.1, 0.121) 0.04 (0.035, 0.044)
Connecticut 3/23/20 0.154 (0.136, 0.172) 0.065 (0.059, 0.07)
  • State column is the state abbreviation for each state where data was available in the U.S.

  • Order.date column is the date on which the first stay at home order was put into effect.

  • Infection.rate.and.confidence.interval.before.order column is the infection rate and confidence interval for this rate for the day before the order went into effect

  • Infection.rate.and.confidence.interval.after.order column is the infection rate and confidence interval of this rate for the day after the order went into effect.


Subsection 2.2 Summary Analysis

JHU CSSE COVID-19 Data

Between 2020-01-22 to 2021-02-24, 2.8336097^{7} total cases of COVID-19 have been detected in the United States and 5.0589^{5} total deaths have been ruled as being caused by COVID-19.

date total_cases total_deaths
Min. :2020-01-22 Min. : 1 Min. : 0
1st Qu.:2020-04-30 1st Qu.: 1107214 1st Qu.: 67774
Median :2020-08-08 Median : 5022981 Median :163216
Mean :2020-08-08 Mean : 7786083 Mean :177519
3rd Qu.:2020-11-16 3rd Qu.:11337674 3rd Qu.:249572
Max. :2021-02-24 Max. :28336097 Max. :505890

As seen in the distributions of cases and deaths by state, California and Texas both appear as outliers with higher numbers of both cases and deaths. However, when the population of these states is taken into account, it begins to provide a possible explanation of the higher numbers found in these states. Additionally, the epidemiologic data suggests that mutated variants of the SARS-CoV-2 that are more infectious and transmissible may be to blame for the high number of cases in these states.


HIFLD Hospitals

As seen in the above visualizations of the geographic distributions of hospitals and trauma centers in the United States, health care institutions tend to be located around population centers. The distributions also show that larger states with larger populations have more hospitals and trauma centers, and are more likely to have lower level trauma centers. Additionally, lower level trauma centers, on average, have more beds for patients that facilities with a higher trauma level.

BEDS
Min. : 2.0
1st Qu.: 30.0
Median : 89.0
Mean : 159.4
3rd Qu.: 223.0
Max. :1592.0
NA’s :188

As seen in the box plot, there are quite a few outliers when it comes to the distribution of beds among trauma center levels. This is likely due to the different populations of different regions, as facilities in more highly-populated areas will need more beds for patients than those in rural areas. It is likely that trauma centers are created based not on population, but rather, geographic distance to another facility able to provide the same level of care.


NYT Mask-Wearing Survey

Grouped by counties, an average of 51% of the responses are “Always,” and an average of 8% of the responses are “Never.” For a single county, the values for each response are supposed to sum to one. In reality, the values are rounded to three decimal places, so the sum for each county ranges from 0.998 to 1.002.

NEVER RARELY SOMETIMES FREQUENTLY ALWAYS sum
Min. :0.00000 Min. :0.00000 Min. :0.0010 Min. :0.0290 Min. :0.1150 Min. :0.998
1st Qu.:0.03400 1st Qu.:0.04000 1st Qu.:0.0790 1st Qu.:0.1640 1st Qu.:0.3932 1st Qu.:1.000
Median :0.06800 Median :0.07300 Median :0.1150 Median :0.2040 Median :0.4970 Median :1.000
Mean :0.07994 Mean :0.08292 Mean :0.1213 Mean :0.2077 Mean :0.5081 Mean :1.000
3rd Qu.:0.11300 3rd Qu.:0.11500 3rd Qu.:0.1560 3rd Qu.:0.2470 3rd Qu.:0.6138 3rd Qu.:1.000
Max. :0.43200 Max. :0.38400 Max. :0.4220 Max. :0.5490 Max. :0.8890 Max. :1.002

There doesn’t seem to be any significant outliers. This is probably because there were 250,000 survey responses for a survey with only 5 options. Any individual county would have to have a lot of different responses in order to be able to become an outlier. Also, there is less chance for outliers because this data set was grouped into counties, forcing all of the columns for each row to sum to one. There are no NA values, and it seems to have data for almost every county.


CDC COVID-19 Vaccinations in the United States

By Feb 22th, there are 68150728 people in the US got vaccination. Grouped by states, there are an average of 21242 per 100,000 (21.2415%) of population in the US given doses. The number of doses administered per 100,000 ranges from 11767 to 39499.

Total_Doses_Administered Doses_Administered_per_100k X18._Doses_Administered X18._Doses_Administered_per_100K
Min. : 7073 Min. :11767 Min. : 7073 Min. :15081
1st Qu.: 241471 1st Qu.:18891 1st Qu.: 240832 1st Qu.:24127
Median : 614928 Median :19881 Median : 614420 Median :25428
Mean :1097822 Mean :21231 Mean :1096961 Mean :27224
3rd Qu.:1396224 3rd Qu.:22824 3rd Qu.:1395704 3rd Qu.:28548
Max. :7728120 Max. :39499 Max. :7724412 Max. :50641

The most significant outlier in the data set is the total vaccination population in California. The possible reason might be overall education level in that states is high and also the population base in CA is large so that there are a great number of people taking the vaccine.


Section 3: Specific Question Analysis